actor¶According to the wiki page, we can get rid of those columns:
standard_text_propertycount_text_propertyconcat_names| pk_actor | concat_actr | concat_standard_name | begin_year | certainty_begin | notes_begin | end_year | certainty_end | notes_end | gender_iso | notes | fk_abob_type_actor | creator | creation_time | modifier | modification_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14521 | 48893 | Actr48893 | Reginaldo - da Genova | 1510.0 | 3 | None | NaN | None | None | 1 | None | 104.0 | 30.0 | 2014-04-05 17:30:35.800 | 30.0 | 2014-04-05 17:30:36 |
| 36706 | 40238 | Actr40238 | Jacquet, Jean Bernardin | 1831.0 | 1 | None | 1881.0 | 1 | None | 1 | None | 104.0 | 24.0 | 2010-11-18 11:09:01.000 | 24.0 | 2013-12-18 15:24:16 |
| 38719 | 43209 | Actr43209 | Albinus, Elisabeth | 1595.0 | 1 | 2 | 1666.0 | 1 | 2 | 2 | None | 104.0 | 25.0 | 2011-05-26 11:53:19.000 | 25.0 | 2013-12-18 15:24:16 |
| 1025 | 14468 | Actr14468 | Denix, Guillaume | 1673.0 | 1 | None | 1673.0 | 1 | None | 1 | None | 104.0 | 28.0 | 2008-12-04 16:35:47.000 | 11.0 | 2013-12-18 15:35:49 |
| 1292 | 14720 | Actr14720 | Frundin, Anne | 1673.0 | 1 | None | NaN | 1 | None | 2 | None | 104.0 | 28.0 | 2008-12-04 16:35:49.000 | 11.0 | 2013-12-18 15:35:49 |
Some of the rows has been identified to not be imported. They can be found with the "[à identifier]" string present in the column concat_standard_name.
Rows number before filter: 61556 Rows number after filter: 59625 (1931 has been removed)
For now we are interested only in the persons.
Persons can be found by having the column fk_abob_type_actor being 104.
Number of not 104 actors: 3
| pk_actor | concat_actr | concat_standard_name | begin_year | certainty_begin | notes_begin | end_year | certainty_end | notes_end | gender_iso | notes | fk_abob_type_actor | creator | creation_time | modifier | modification_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10340 | 59031 | Actr59031 | Forster, James | 1830.0 | 3 | 3 | 1930.0 | 3 | 3 | 1 | None | 106.0 | 81.0 | 2016-11-29 11:05:00.060 | 81.0 | 2016-11-29 11:05:00 |
| 28956 | 60660 | Actr60660 | Valjean, Jean | 1769.0 | 1 | None | 1833.0 | 1 | None | 1 | None | 106.0 | 122.0 | 2018-10-23 16:48:50.050 | 122.0 | 2018-10-23 16:48:50 |
| 46023 | 46914 | Actr46914 | Dieu (conception chrétienne) | NaN | 1 | None | NaN | None | None | 0 | None | 106.0 | 3.0 | 2013-07-04 11:43:15.990 | 3.0 | 2013-12-18 15:24:16 |
Columns contain: Total number of rows: 59622 - "pk_actor": 0.00% empty - 59622 (100.00%) uniques (eg: 44895; 47015) - "concat_actr": 0.00% empty - 59622 (100.00%) uniques (eg: Actr44895; Actr47015) - "concat_standard_name": 0.00% empty - 56635 ( 94.99%) uniques (eg: Sainte-Mar...; Costantino...) - "creation_time": 0.00% empty - 34508 ( 57.88%) uniques (eg: 2012-04-08...; 2013-07-26...) - "modification_time": 0.00% empty - 14053 ( 23.57%) uniques (eg: 2013-12-18...; 2016-10-21...) - "creator": 0.01% empty - 88 ( 0.15%) uniques (eg: 43.0; 30.0) - "gender_iso": 0.04% empty - 4 ( 0.01%) uniques (eg: 1; 2) - "modifier": 8.90% empty - 85 ( 0.14%) uniques (eg: 2.0; 30.0) - "certainty_begin": 9.40% empty - 4 ( 0.01%) uniques (eg: 3; 1) - "certainty_end": 14.47% empty - 5 ( 0.01%) uniques (eg: 3; None) - "begin_year": 18.58% empty - 848 ( 1.42%) uniques (eg: 1870.0; 1506.0) - "end_year": 50.68% empty - 819 ( 1.37%) uniques (eg: 1930.0; 1545.0) - "notes_begin": 67.71% empty - 5 ( 0.01%) uniques (eg: 3; 2) - "notes_end": 72.40% empty - 6 ( 0.01%) uniques (eg: 3; 4) - "notes": 89.83% empty - 6031 ( 10.12%) uniques (eg: <p>Il s'ag...; None)
According to the table before, we will parse each column by the most meaningful type.
Here we will report the analysis of interesting information found on different columns. They are not exhaustive.
For some of the column, we will update their value.
We observe some of the gender being undefined. As the ISO mentions, it should be 0, 1, 2 or 9. So we replace the undefined gender by 0.
We replace the not filled values by 0.
We replace the not filled values by 0.
All HTML tags, non ASCII chars and new line are removed.